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' A theory of systems with long-range correlations based on the consideration of binary N-step 

' Markov chains is developed. In our model, the conditional probability that the i-th symbol in the 

, chain equals zero (or unity) is a linear function of the number of unities among the preceding A'' 

symbols. The model allows exact analytical treatment. The correlation and distribution functions 
^ ' as well as the variance of number of symbols in the words of arbitrary length L are obtained 

Q , analytically and numerically. A self-similarity of the studied stochastic process is revealed and the 

similarity transformation of the chain parameters is presented. The diffusion equation governing 
the distribution function of the L-words is explored. If the persistent correlations are not extremely 
CN| , strong, the distribution function is shown to be the Gaussian with the variance being nonlinearly 

dependent on L. The applicability of the developed theory to the coarse-grained written and DNA 
texts is discussed. 
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I. INTRODUCTION 



> 

c/3 , The problem of long-range correlations is one of the topics of current research in the statistical mechanics. The 
O ' stochastic processes with strong correlations have been observed in numerous systems. Examples include the natural 
c/j '. languages, the DNA sequences, etc. |[. 

One of the efficient methods to investigate the correlated systems is based on a decomposition of the space of states 
] into a finite number of parts labelled by definite symbols. This procedure referred to as coarse graining is accompanied 
I by the loss of short-range memory between symbols but does not affect and does not damage the robust invariant 

statistical properties of the long-range correlated sequences. In terms of the power spectrum, one loses only the 
^-H ,: short-wave part of the spectrum. The most frequently used method of the decomposition is based on the introduction 
^ ' of two parts of the phase space. In other words, it consists in mapping the two parts of states onto two symbols, say 
, and 1. Thus, the problem is reduced to the investigation into the statistical properties of the binary sequences. This 
method is applicable for the examination of the both discrete and continuous systems. 

One of the ways to get a correct insight into the nature of correlations in a system consists in an ability of 
^-H • constructing a mathematical object (for example, a correlated sequence of symbols) possessing the same statistical 
CnI ' properties as the initial system. There are many algorithms to generate long-range correlated sequences: the inverse 
. Fourier transform |^ , the expansion- modification Li method Q , the Voss procedure of consequent random addition , 
the correlated Levy walks etc. We believe that from among the above-mentioned methods, using the Markov 
O , chains is one of the most important. We would like to demonstrate this statement in the present paper. 
c/3 ' In the next sections, the statistical properties of the binary many-steps Markov chain is examined. In spite of 
the long-time history of studying the Markov sequences (see, for example, ||, ||]), the concrete expressions for 
the variance of sums of random variables in such strings have not been obtained yet. Our model operates with 
two parameters governing the conditional probability of the discrete Markov process, specifically with the memory 
length N and the correlation parameter ^. The correlation and distribution functions as well as the variance D being 
nonlinearly dependent on the length L of a word are derived analytically and calculated numerically. The nonlinearity 
of the D{L) function rcfiects the strong correlations in the system. The evolved theory is applied to the coarse-grained 
written texts and dictionaries, and to DNA strings as well. 



II. FORMULATION OF THE PROBLEM 



A. Markov Processes 



Let us consider a stationary binary sequence of symbols, — {0, 1}. To determine the N-step Markov chain we 
have to introduce the conditional probability P{ai = | a^-Ar, ai_jv+i, ■ ■ ■ , Oi-i) of following the definite symbol 
(for example, zero) after symbols Ui-N, o,i-N+i, ■ ■ ■ , ij-i- Thus, it is necessary to define 2^ values of the P- function 
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corresponding to each possible configuration of the symbols ai^^, ai^j^^i, . . . , ai_i. We suppose that the P-function 
has the form, 



1 ^ 

P{ai = I ai_Af , ai_Af+i, . . . , a^.i) ^ f{ai-k,k). (1) 



k=l 



It is reasonable to assume the function / to be decreasing with an increase of the distance k between the symbols 
Ui^k and ai in the Markov chain. However, for the sake of simplicity we consider here a step-like memory function 
f{ai-k,k) independent of the second argument k. As a result, our model is characterized by three parameters only, 
specifically by /(O), /(I), and A^: 



1 ^ 
fc=i 

Note that the probability P in Eq. (H) depends on the numbers of symbols and 1 in the A^-word but is independent 
of the arrangement of the elements ai^^. We also suppose that 

/(0) + /(l) = l. (3) 

This relation provides the statistical equality of the numbers of symbols zero and unity in the Markov chain under 
consideration. In other words, the chain is non-biased. Indeed, taking into account Eqs. (|^) and (||) and the sequence 
of equations, 



1 ^ 

P(aj = l|ai_jv, . . . , Qi-i) = 1 - P{ai = 0|aj_jv, • • ■ , ai-i) = ^ X! /(^»-^) = ^("» = ^ I ^^i-N, ■ • • , a^-i), (4) 

fe=i 

one can see the symmetry with respect to interchange Hi ^ ai in the Markov chain. Here di is the symbol opposite 
to ai, di = I — ai. Therefore, the probabilities of occurring the words (ai, . . . , a^) and (fii, . . . , ol) are equal to each 
other for any word length L. At L = 1 this yields equal average probabilities of occurring and 1 in the chain. 

Taking into account the symmetry of a conditional probability P with respect to a permutation of symbols ai (see 
Eq. (|2|)), we can simplify the notations and introduce the conditional probability of occurring the symbol zero after 
the Af-word containing k unities, e.g., after the word (11.. .1 00^), 

1 2k 

Pk = P{aN+i = I U . 00 . ._^) = - + m(1 - T?)^ (5) 



2 " ' A^- 



fe N-k 

with the correlation parameter /z being defined by the relation 

/^-/(O)-^. (6) 

We focus our attention on the region of /i determined by the persistence inequality < /i < 1/2. Nevertheless, the 
major part of our results is valid for the anti-persistent region — l/2</i<0as well. 

A similar rule for the production of an A^-word (ai, . . . , oat) following after a word (ao, ai, . . . , a^v-i) was suggested 
in Ref. [^j. However, the conditional probability pk of occurring the symbols ojv does not depend on the previous 
ones in the model 01. 



B. Statistical characteristics of the chain 



In order to investigate the statistical properties of the Markov chain, we consider the distribution WL{k) of the 
words of definite length L by the number k of unities in them, 

L 

kt{L)^^ai+i, (7) 
1=1 



3 



and the variance of fc, 



D{L) = P - k\ (8) 



where 

L 

7(fc) = ^/(fc)W^i(fc). (9) 

fc=0 

If /i = 0, one arrives at the known result for the non-correlated Brownian diffusion, 

L/4. (10) 

We will show that the distribution function WL{k) for the sequence determined by Eq. (||) (with nonzero but not 
extremely close to 1/2 parameter /z) is the Gaussian with the variance D[L) nonlinearly dependent on L. 



C. Main equation 



For the stationary Markov chain, the probability b{aia2 ■ ■ ■ ajv) of occurring a certain word (ai, 02, . . . , apf) satisfies 
the equation. 



b{ai . . . ajv) = ^ b{aai . . . aN-i)P{aN | a, ai, . . . , un-i)- 



(11) 



Thus, we have 2^ homogeneous algebraic equations for 2^ words and the normalization equation ^6=1. The set 
of equations can be essentially simplified owing to the following statement. 

Proposition ^: The probability b{aia2 ■ ■ - cln) depends on the number k 0/ unities in the N-word only but is 
independent of the arrangement of symbols in the word (ai, 02, ... , cat). 

This statement illustrated by Fig. 1 is valid owing to the chosen simple model (^, ^ of the Markov chain. It can 
be easily verified directly by the substitution of the obtained below solution Eq. ([T^) into set (|ll|). Proposition 4|k 
leads to the very important property of isotropy: any word (ai, 02, . . . , a^) appears with the same probability as the 
inverted one, (a^, ai_i, . . . , ai). 
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FIG. 1: The probability of occurring a word (ai, a2, . . . , ajv) vs its number z in the binary code, z — X^^j^ a; • 2* ^ for N = 8, 
fi = 0.4. 



Let us apply set of Eqs. (|Tl|) to the word (1^_^^ 00^^): 

k N-k 



^^Ih^ ^2,^) = 00^)pfe + ^(1 11_;^ 00_^)pfc+i. 

fc N-k k N-k-1 k N-k-1 



(12) 
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This yields the recursion relation for b{k) = b{VL^ ^2^^^^ 

k N-k 



III. DISTRIBUTION FUNCTION OF L- WORDS 



In this section we investigate the statistical properties of the Markov chain, specifically, the distribution of the 
words of definite length L by the number k of unities in them. The length L can also be interpreted as the number 
of jumps of some particle over an integer-valued 1-D lattice or as the time of the diffusion imposed by the Markov 
chain under consideration. The form of the distribution function depends essentially on the relation between 

the word length L and the memory length N . Therefore, we start our study with the simplest case, L — N . 



A. Statistics of A^-words 



The value h{k) is the probability that an A'^-word contains k unities with a definite order of symbols a^. Therefore, 
the probability Wjv(A:) that an A'^-word contains k unities with arbitrary order of symbols ai is b{k) multiplied by the 
number = N\/k\{N — k)\ of different permutations of k unities in the A^-word, 

WN{k) = C%b{k). (14) 

Combining Eqs. (^ and (|lj), we get 

w n^ Ti^ /n^r.fe r(n + fc)r(n + iV - /c) 

WN{k) = W^iml r/ F ^ An 

1 [n)l [n + N) 



with the parameter n defined by 

N(l - 2fi) 
n = . 

The normalization constant W]\[{0) can be obtained from the equality, 

N 



(16) 



J2c%bik) = l. (17) 



*;=0 

Note that the distribution WN{k) is an even function of the variable k = k — N/2, 

WNiN -k) = WNik). (18) 

This fact is a direct consequence of the above-mentioned statistical equivalency of zeros and unities in the Markov chain 
under consideration. Let us analyze the distribution function Wifik) for different relations between the parameters 
N and /x. 



1. Limiting case of weak persistence, n ^ 1 

In the absence of correlations, ^ — > 0, Eq. ( [Tsl ) and the Stirling formula yield the Gaussian distribution at k, N, N — 
k ^ 1. In the most interesting case of not too strong persistence, n ^ 1, one can also obtain the Gaussian form for 
the distribution function. 



^27tD{N) I 2D{N) 

with the /i-dependent variance, 
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Equation ( |i9| ) says that the A^-words with equal numbers of zeros and unities, k = N/2, are mostly probable. Note 
that the persistence results in an increase of the variance D{N) with respect to its value N/A at /i = 0. In other 
words, the persistence is conductive to the intensification of the diffusion. Inequality n 3> 1 provides D{N) <C N'^- 
Therefore, despite the increase of D{N), the fluctuations of (k — N/2) of the order of N are exponentially small. 

2. Intermediate case, n ~ 1 

If the parameter n is an integer of the order of unity, the distribution function WN{k) is a polynomial of degree 
2(n — 1). In particular, at n = 1, the function W7v(fc) is constant, 

WNik) = (21) 
At n ^ 1, WN{k) has a maximum at the middle of the interval [0, N]. 

3. Limiting case of strong persistence, n <g 1 

In this situation, WN{k) assumes the maximal values at fc = and k = N, 

nN 

Wn{1) = Wn{0)j;^ « Wn{0). (22) 

Formula ( ^2|) describes the sharply decreasing Wi\!{k) as k changes from to 1 (and from to — 1). Then, at 
1 < k < N/2, the function WN{k) decreases more slowly with an increase in k, 

^-(^) = ^-(«)A;(i?^- 

At A; = N/2, the probability WN{k) achieves its minimal value, 

WN(^^j^W^iO)-. (24) 
The values Wn{0) = WNiN) are nearly 1/2 to a logarithmic accuracy. 

The evolution of the distribution function WN{k) from the Gaussian form to the inverse one with a decrease of the 
parameter n is shown in Fig. 2. Below we restrict ourselves to the case of weak persistence, n ^ 1. 



Formulas (19) and (|20| ) describe the statistical properties of L-words for the fixed "diffusion time" L — N. It is 
necessary to look into the distribution fmiction WL{k) for the general situation, L ^ N. We start the analysis with 
L<N. 

B. Statistics of L- words with L < N 

1. Distribution function WL{k) 
The distribution function WL{k) dX L < N can be given as 

k+N-L 

Wi.{k)= b{^)ClC'^\. (25) 

i—k 

This equation follows from the consideration of iV-words consisting of two parts, 

(ai, . . . , flAT-L, qn-l+i, ■ ■ ■ , Oat). (26) 

" v ' ^ V ' 

i — k unities k unities 

The total number of unities in this word is i. The right-hand part of the word (L-sub-word) contains k unities. The 
remaining i — k unities are situated within the left-hand part of the word ((A^ — _L)-sub-word). The multiplier C^C^^^, 



6 




k 

FIG. 2: The distribution function Wjv(fc) for N=2Q and different values of the parameter n shown near the curves. 



in Eq. (25) takes into account all possible permutations of the symbols "1" within the iV-word on condition that the 
i-sub-word always contains k unities. Then we perform the summation over all possible values of the number i. Note 
that Eq. ( p5|) is valid due to the main proposition 4tk formulated in Subsec. C of the previous section. 

In this subsection, we focus our attention on the most important limiting case n, fc, L — A: 3> 1. The straightforward 
calculations with using the Stirling formula give the result, 

with 

D(L) = -(1 + mL), m^ — ^^^ ^. (28) 

^ ^ 4^ 2n iV(l-2^) ^ ' 

The last equation allows one to analyze the behavior of the variance D{L) with an increase in the "diffusion time" 
L. At small mL <C 1 the dependence D{L) follows the classical law of the Brownian diffusion, D{L) « L/A. Then, at 
mL ~ 1, the function D{L) becomes super-linear and meets the value ( pO| ) ai L = N. 

Such an unusual behavior of the variance D{L) raises an issue as to what particular type of the diffusion equation 
corresponds to the nonlinear dependence D{L) in Eq. (p8|). In the following subsection, when solving this problem, 
we will obtain the conditional probability of occurring the symbol zero after a given L-word with L < N. The 
ability to find p^^\ with some reduced information about preceding symbols being available, is very important for the 
study of the self-similarity of the Markov chain (see Subsubsec. 3 of this Subsection). 



2. Generalized diffusion equation 



It is quite obvious that the distribution WL{k) satisfies the equation 

WL+i{k) = VKL(fc)p(")(fc) -f WL(k - l)p(i)(fc - 1). (29) 

Here is the probability of occurring "0" after an average-statistical L-word containing k unities and ]P-^ is the 
probability of occurring "1" after an L-word containing fc — 1 unities. At L < A^, the probability -p^^^ can be written 
as 

k+N-L 

P^'" = WUk) ^ P^b{i)ClG^^\. (30) 

The product 6(i)C|C^_^2, this formula represents the conditional probability of occurring the A'-word containing 
i unities, the right-hand part of which, the L-sub-word, contains k unities (compare with Eqs. (25), (EF 



7 



The product b{i)C'^\ in Eq. (|o|) is a sharp function of i with a maximum at some point i = io whereas pi obeys 
hnear law (^). This impHes that Pi can be factored out of the summation sign being taken at point i — iq. The 
asymptotical calculation shows that point io is described by the equation, 

N L/2 f 2k\ 

*0 = ^ - . .;.n 1 - ^ • (31) 



2 1 - 2/i(l - L/N) \ L 
Expression (|^) taken at point k = io gives the desired formula for p^^^ because 

k+N-L 

6(z)CiC-_^ (32) 

i—k 

is obviously equal to Thus, we have 



p(>')(fc) = - + L- 1 . (33) 



Let us consider a very important fact following from Eq. ( plD . If the concentration of unities in the right-hand part 
of the word ( p^ ) is higher than 1/2, k/L > 1/2, then the most probable concentration (iq — k)/{N — L) of unities in 
the left-hand part of this word is likewise increased, (io — k)/{N — L) > 1/2. At the same time, the concentration 
(io — k)/ {N — L) is less than k/L, 

1 ^ k 

2 N~L L ^ ' 

This means that the increased concentration of unities in the L- words is necessarily accompanied by the existence of a 
certain tail with an increased concentration of unities as well. We name such a phenomenon as the macro-persistence. 
An analysis performed in the next section will indicate that the correlation length Ic of this tail is with 7 > 1 
dependent on the parameter jj, only. It is evident from the above-mentioned property of the isotropy of the Markov 
chain that there are two correlation tails from the both sides of the L-word. 

By going over to the continuous limit in Eq. ( [26| ) and using Eq. ( ^3| ) and the relation p^^\k — 1) = 1 — p^'^\k — 1), 
we obtain the diffusion equation generalized to the case of the correlated Markov process, 

with K = k — L /2 and 

„(L) = . (36) 



Equation (35) has a solution of the Gaussian form Eq. ( p7| ) with the variance D{V) satisfying the ordinary differential 
equation, 

% = \^2r,{L)D. (37) 



Its solution with the boundary condition -D(O) = coincides with (k 



3. Self-similarity of the persistent Brownian diffusion 

In this subsection we point out one of the most important properties of the Markov chain being considered, namely, 
its self-similarity. Let us reduce the A^-step Markov sequence by regularly (or randomly) removing some symbols and 
introduce the decimation parameter A, 

A = A7A^ < 1. (38) 

Here N* is a renormalized memory length for the reduced A^*-step Markov chain. According to Eq. (|3^), the 
conditional probability p^. of occurring the symbol zero after k unities among the preceding A'^* symbols is described 
by the formula. 



(39) 
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with 

The comparison of Eqs. (||) and (^9|) shows that the reduced chain possesses the same statistical properties as the 
initial one but is characterized by the renormalized parameters (iV*, /i*) instead of {N, jj). Thus, Eqs. ( p8| ) and ( |40| ) 
determine the one-parametrical renormalization of the parameters of the stochastic process defined by Eq. (^) . 
The astonishing property of the reduced sequence consists in that the variance D*{L) is invariant with respect 



to the one-parametric decimation transformation ([,gqj; (^tj- Therefore, it coincides with the function D(L) for the 
initial Markov chain: 

The invariance of the function D{L) at L < N was referred by us to as the phenomenon of self-similarity. It is 
demonstrated in Fig. 3 and is also discussed in Sec. IV A. 




FIG. 3: The dependence of the variance D on the tuple length L for the generated sequence with A'^ = 100 and /i — 0.4 (solid 
line) and for the decimated sequences (the parameter of decimation A = 0.5). Squares and circles correspond to the stochastic 
and deterministic reduction, respectively. The thin solid line describes the non-correlated Brownian diffusion, D{L) — L/i. 



C. Long-range diffusion, L > N 



Unfortunately, the very useful proposition ^ is valid for the words of the length L < N only and cannot be applied 
to the analysis of the long words with L > N. Therefore, investigating the statistical properties of the long words 
represents a rather challenging combinatorial problem and requires new physical approaches for its simplification. 
Thus, we start this subsection by analyzing the correlation properties of the long words. 



1. Correlation length 



Let us rewrite Eq. (|5|) in the form, 

<a,+i>=i+^[^ <afc>-l). (42) 

The angle brackets denote the averaging of the density of unities in some region of the Markov chain for its definite 
realization. The averaging is performed over distances much greater than unity but far less than the memory length 
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N. Note that this averaging differs from the statistical averaging over the ensemble of realizations of the Markov 
chain denoted by the bar in Eqs. (H) and (||). Equation (|4|) is a relationship between the average densities of unities 
in two different macroscopic regions of the Markov chain, namely, in the vicinity of (i + l)-th element and in the 
region (i — N, i). Such an approach is similar to the mean field approximation in the theory of the phase transitions 
and is asymptotically exact under the condition N ^ oo. In the continuous limit, Eq. (E2|) can be rewritten in the 
integral form. 



(43) 



It has the obvious solution. 



< - i a(0) - ^ > exp (t/jN) , (44) 



where 7 is determined by the relation. 



, ( exp ( - 1) ^ ^. (45) 



The last equation has a unique solution for any value of /i £ (0, 1/2). 

Formula (Q) shows that any fluctuation (the difference between < a{i) > and the equilibrium value of a7 = 1/2) is 
exponentially damped at distances of the order of the correlation length Ic, 

Ic = 7^- (46) 

Law ( ^4| ) describes the phenomenon of the persistent macroscopic correlations discussed in the previous subsection. 
This phenomenon is governed by both of the parameters, the memory length N and the persistence parameter fi. 
According to Eqs. (^5|), (^), the correlation length Ic goes logarithmically to infinity with an increase in /x, at ^ 1/2. 
At /i — > 0, the macro-persistence is broken and the correlation length tends to zero. 

2. Correlation function 

Using the already studied correlation properties of the the Markov sequence and some heuristic reasons, one can 
obtain the correlation function /C(r), 

JC{r) = atOi+r - of, (47) 

and then the variance D{L), 



D{L)=(j2ay-(j2aA . (48) 



Comparing Eqs. ( |4^ ) and ( ^ ) and taking into account the unbiased property of the sequence, = 1/2, it is easy to 
derive the general relationship between the functions /C(r) and D{L), 



- 2 L — 1 h — i 



D{L) = ^+AY,Y.^{r). (49) 

i—1 r—1 



Considering ( p9| ) as an equation with respect to K^{t), one can find its solution given as 

/C(l) = il?(2) - 1 K:{2) = \m-Dm + l, 



IC{r) = ^[D{r + 1) -2D{r) + D{r -1)], r > 3. (50) 



This solution has a very simple form in the continuous limit, 
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Equations (BOh and (Bq) give the correlation function at r < A'', 



/C(r) = aY^^ = am, Ci = l/2, C2 = 1/8, C3<.<w = 1/4, 



(52) 



/C(r) = -, r < iV 



(53) 



in the continuous approximation. The independence of the correlation function of r at r < iV results from our choice 
of the conditional probability in the simplest form (|5). At r > A^, the function /C(r) should decrease because of loss 
of the memory. Therefore, based on Eqs. (Q) and (46|), let us prolongate the correlator /C(r) as the exponentially 
decreasing function at r > iV, 



/C(r) 



Correspondingly, the variance D{L) becomes 



T|exp(-2:^), r>N. 



(54) 



I?(L) = -(l + mF(L)), 



with 



F{L) 



2(l + 7)iV-(l + 27)^-272i^ 



1 — exp 



L<N, 
L> N. 



(55) 



(56) 



The plot of Eq. (|5q ) for N — 100 and /i — 0.4 is shown by the solid line in Fig. 4. For comparison, the straight line 
in the figure corresponds to the dependence D{L) = L/A for the usual Brownian diffusion without correlations (for 
/i = 0). It is clearly seen that the plot of variance (^ ) contains two qualitatively different portions. One of them, 
at L A^, is the super- linear curve that moves away from the line D = L/A with an increase of 1/ as a result of the 
persistence. For L N, the plot D{L) achieves the linear asymptotics, 



D{L) = ^{1 



2(1 + 7)m7V). 



(57) 



This phenomenon can be explained as a result of the diffusion where each practically independent step ~ D^^^{N + lc) 
of wandering represents a path traversed during the characteristic "fluctuating time" AL ~ {N + Ic). Since these 
steps of wandering are quasi-independent, the distribution function WL{k) is the Gaussian not only at L < (see 
Eq. (p7|)) but also in the case L > N, Ic- 

Note that the above-mentioned phenomenon of the self-similarity relates only to the portion L < N of the curve 
p{L). Since the decimation procedure leads to the decrease of the parameter fi (see Eq. (|40|)), the plot of asymptotics 
(p7|) for the reduced sequence at L 3> A^* goes below the plot for the initial chain. 



IV. RESULTS OF NUMERICAL SIMULATIONS AND APPLICATIONS 



In this section, we support the obtained above analytical results by numerical simulations of the Markov chain with 
the conditional probability Eq. (^). Besides, the properties of the studied binary A^-step Markov chain are compared 
with ones for the natural objects, specifically for the coarse-grained written and DNA texts. 



A. Numerical simulations of the Markov chain 



The first stage of the construction of the A^-step Markov chain was a generation of the initial non-correlated A^ 
symbols, zeros and unities, identically distributed with equal probabilities 1/2. Each consequent symbol was then 
added to the chain with the conditional probability determined by the previous N symbols in accordance with Eq. (|^). 
Than we numerically calculated the variance D{L) by means of Eq. (|^). The circles in Fig. 4 represent the calculated 
variance D{L) for the 100-step Markov chain generated at = 0.4. A very good agreement between the analytical 
result in Eq. (p3) and the numerical simulation can be observed. 
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FIG. 4: The numerical simulation of the dependence D{L) for the generated sequence with TV — 100 and fi — 0.4 (circles). The 
solid line is the plot of function Eq. (55) with the same values of A'' and /i. 

The numerical simulation was also used for the demonstration of the proposition <|k (Fig. 1) and the self-similarity 
property of the Markov sequence (Fig. 3). The squares in Fig. 3 represent the variance D{L) for the sequence 
obtained by the stochastic decimation of the initial Markov chain (solid line) where each symbol was omitted with 
the probability 1/2. The circles in this figure correspond to the regular reduction of the sequence by removing each 
second symbol. 

And finally, the numerical simulations have allowed us to make sure that we are able to determine the parameters 
N and /i of a given binary sequence. We generated Markov sequences with different parameters N and and 
defined numerically the corresponding curves D(L). Then we solved the inverse problem of the reconstruction of the 
parameters N and fi by analyzing the curves D{L). The reconstructed parameters were always in a good agreement 
with their prescribed values. In the next subsections we apply this ability to the treatment of the statistical properties 
of literary and DNA texts. 



It is well-known that the statistical properties of the coarse-grained texts written in any language show a remarkable 
deviation from random sequences ^ . In order to check the applicability of the theory of the binary iV-step Markov 
chains to literary texts we resorted to the procedure of coarse graining by the random mapping of all characters of 
the text onto binary set of symbols, zeros and unities. The statistical properties of the coarse-grained texts depend, 
but not significantly, on the kind of mapping. This is illustrated by the curves in Fig. 5 where the variance D{L) for 
five different kinds of the mapping of Bible is presented. Usually, the random mapping leads to nonequal numbers of 
unities and zeros, ki and ko, in the coarse-grained sequence. The particular analysis shows that the variance D{L) 
(Eq) gets the additional multiplier. 




in this biased case. In order to derive the function D{L) for the non-biased sequence, we divided the numerically 
calculated value of the variance by this multiplier. 

The study of different written texts has shown that all of them are featured by the pronounced persistent correlations. 
It is demonstrated by Fig. 6 where five variance curves go significantly higher than the straight line D = _L/4. However, 
it should be emphasized that regardless of the kind of mapping the initial portions, L < 80, of the curves correspond 
to a slight anti-persistent behavior (see insert to Fig. 7). Moreover, for some inappropriate kinds of mapping (e.g., 
when all vowels are mapped onto the same symbol) the anti-persistent portions can reach the values of L ~ 1000. In 
order to avoid this problem, all the curves in Fig. 6 are obtained for the definite representative mapping: (a-m) — > 0; 



B. Literary texts 




(fco + A:i)2' 



(n-z) ^ 1. 
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FIG. 5: The dependence -D(L) for the coarse-grained text of the Bible obtained by means of five different kinds of random 
mapping. 



Q 




L 

FIG. 6: The dependence D{L) for the coarse-grained texts of collection of works on the computer science (m = 2.2-10"^, solid 
line), Bible in Russian (m = 1.9 • 10~^, dashed line), Bible in English (m — 1.5 • 10~^, dotted line), "History of Russians in the 
20-th Century" by Oleg Platonov (m = 6.4- 10~*, dash-dotted line), and "Alice's Adventures in Wonderland" by Lewis Carroll 
(m = 2.7 • 10"", dash-dot-dotted line). 

Thus, the persistence is the common property of the binary iV-stcp Markov chains and the coarse-grained written 
texts at large scales. Moreover, the written texts as well as the Markov sequences possess the property of the 
self-similarity. Indeed, the curves in Fig. 7 obtained from the text of Bible with different levels of the deterministic 
decimation demonstrate the self-similarity. Presumably, this property is the mathematical reflection of the well-known 
hierarchy in the linguistics: letters — > syllables — > words — > sentences — > paragraphs chapters — » books collected 
works. 

All the above-mentioned circumstances allow us to suppose that our theory of the binary A^-stcp Markov chains can 
be applied to the description of the statistical properties of the texts of natural languages. However, in contrast to the 
generated Markov sequence (see Fig. 4) where the full length A4 of the chain is far greater than the memory length N, 
the coarse-grained texts described by Fig. 6 are of relatively short length < A''. In other words, the coarse-grained 
texts are similar not to the Markov chains but rather to some non-stationary short fragments. This implies that each 
of the written texts is correlated throughout the whole of its length. Therefore, for the written texts, it is impossible 
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FIG. 7: The dependence of the variance D on the tuple length L for the coarse-grained text of Bible (solid line) and for the 
decimated sequences with different parameters A: A = 3/4 (squares), A = 1/2 (stars), and A = 1/256 (triangles). The insert 
demonstrate the anti-persistent portion of the D(L) plot for Bible. 

to observe the second portion of the curve D{L) parallel (in the log-log scale) to the line D{L) = -L/4, similar to that 
shown in Fig. 4. As a result, one cannot define the values of the both parameters N and for the coarse-grained 
texts. The analysis of the curves in Fig. 6 can give the combination m = 2fi/N{l — 2fi) only (see Eq. (p8|)). Perhaps, 
this particular combination is the real parameter governing the persistent properties of the literary texts. 

We would like to note that the origin of the long-range correlations in the literary texts is hardly related to the 
grammatical rules as is claimed in Ref. [|| . At short scales L < 80 where the grammatical rules are in fact applicable 
the character of correlations is anti-persistent whereas semantic correlations lead to the global persistent behavior of 
the variance D{L) throughout the whole of literary text. 

The numerical estimations of the persistent parameter m and the characterization of the languages and different 
authors using this parameter can be regarded as a new intriguing problem of the linguistics. For instance, the 
unprecedented low value of m for the very inventive work by Lewis Carroll as well as the closeness of m for the texts 
of English and Russian versions of Bible are of certain interest. 

It should be noted that there exist special kinds of short-range correlated texts which can be specified by both of 
the parameters, N and /x. For example, all dictionaries consist of the families of words where some preferable letters 
are repeated more frequently than in their other parts. Yet another example of the shortly correlated texts is any 
lexicographically ordered list of words. The analysis of written texts of this kind is given below. 

C. Dictionaries 

As an example, we have investigated the statistical properties of the coarse-grained alphabetical list of the most 
frequently used 15462 English words. In contrast to other texts, the statistical properties of the coarse-grained 
dictionaries are very sensitive to the kind of mapping. If one uses the above-mentioned mapping, (a-m) 0; (n-z) 
1, the behavior of the variance D(L) similar to that shown in Fig. 6 would be obtained. The particular construction 
of the dictionary manifests itself if the preferable letters in the neighboring families of words are mapped onto the 
different symbols. The variance D{L) for the dictionary coarse-grained by means of such mapping is shown by circles 
in Fig. 8. It is clearly seen that the graph of the function D{L) consists of two portions similarly to the curve in Fig. 4 
obtained for the generated iV-step Markov sequence. The fitting of the curve in Fig. 8 by the function ( p5|) (solid line 
in Fig. 8) yielded the values of the parameters N — 180 and fi — 0.44. The parameter 7 given by Eq. p3) is around 
3.9. Note that the characteristic fluctuation length iV(l + 7) for these N and 7 is nearly 900. This value corresponds 
qualitatively to the length of the family of words in the dictionary. 
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FIG. 8: The dependence D{L) for the coarse-grained alphabetical list of 15462 English words (circles). The solid line is the 
plot of function Eq. (55) with the fitting parameters N = 180 and /x = 0.44. 



D. DNA texts 

It is known that any DNA text is written by four "characters", specifically by adenine (A), cytosine (C), guanine 
(G), and thymine (T). Therefore, there are three nonequivalent types of the DNA text mapping onto one-dimensional 

binary sequences of zeros and unities. The first of them is the so-called purinc-pyrimidinc rule, {A,G} 0, {C,T} 
1. The second one is the hydrogen-bond rule, {A,T} 0, {C,G} 1. And, finally, the third is {A,C} 0, {G,T} 
^ 1. 




10° 10^ 10^ 10' 10" 10® 10' 
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FIG. 9: The dependence D{L) for the coarse-grained DNA text of Bacillus subtilis, complete genome, for three nonequivalent 
kinds of the mapping. Solid, dashed, and dash-dotted lines correspond to the mappings {A,G} 0, {C,T} — > 1 (the parameter 
m = 4.1 • 10-2), {A,T} ^ 0, {C,G} ^ 1 (m = 2.5 • 10"^), and {A,C} ^ 0, {G,T} ^ 1 (m = 1.5 • 10"^), respectively. 

By way of example, the variance D{L) for the coarse-grained text of Bacillus subtilis, complete genome 
(ftp:/ /ftp.ncbi.nih.gov/genomes/bacteria/bacillus^ubtilis/NC_000964.gbk) is displayed in Fig. 9 for all possible types 
of mapping. One can see that the persistent properties of DNA are more pronounced than for the written texts and, 
contrary to the written texts, the D{L) dependence for DNA docs not exhibit the anti-persistent behavior at small 
values of L. However, as well as for the written texts, the D{L) curve for DNA does not contain the linear portion 
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given by Eq. (pT\). This suggests that the DNA chain is not a stationary sequence. In this connection, we would Uke 
to point out that the DNA texts represent the collection of extended non-coding regions interrupted by small coding 
regions (see, for example, [||). According to Fig. 9, the coding regions do not interrupt the correlation between the 
non-coding areas, and the DNA system is fully correlated throughout its whole length. 

The noticeable deviation of different curves in Fig. 9 from each other demonstrate, in our opinion, that the DNA 
texts are much more complex objects in comparison with the written ones. Indeed, the different kinds of mapping 
reveal and emphasize various types of physical attractive correlations between the nucleotides in DNA, such as the 
strong purine-purine and pyrimidine-pyrimidine persistent correlations (the upper curve), and the correlations caused 
by a weaker attraction A<->T and C<->G (the middle curve). 



V. CONCLUSION 



Thus, we have developed a new approach for the description of the strongly correlated one-dimensional systems. 
The simple exactly solvable model of the uniform binary iV-step Markov chain is presented. The memory length 
N and the parameter /z of the persistent correlations are two parameters in our theory. Usually, the correlation 
function IC{r) is employed as the input characteristics for the description of the correlated random systems. Yet, the 
function /C(r) describes not only the direct interconnection of the elements Oj and a^+ri but also takes into account 
their indirect interaction via other elements. Since our approach operates with the "origin" parameters N and /i, we 
believe that it allows us to disclose the intrinsic properties of the system which provide the correlations between the 
elements. 

We have demonstrated the applicability of the suggested theory to the different kinds of correlated stochastic 
systems. However, there exist some aspects which cannot be interpreted in terms of our two-parameter model. 
Obviously, more complex models should be developed for the adequate description of real correlated systems. 

We acknowledge to Dr. S.V. Denisov who drew our attention to the exposed problem, Yu.L. Rybalko for consulta- 
tions and kind assistance in the numerical simulations, S.S. Mel'nik and M.E. Serbin for the helpful discussions. 



[1] H.E. Stanley et. al, Physica A 224,302 (1996). 

[2] A. Provata and Y. Almirantis, Physica A 247, 482 (1997). 

[3] A. Czirok, R.N. Mantegna, S. Havlin, and H.E. Stanley, Phys. Rev. E 52, 446 (1995). 
[4] W. Li, Europhys. Let. 10, 395 (1989). 

[5] R.F. Voss, in; Fundamental Algorithms in Computer Graphics, ed. R.A. Earnshaw (Springer, Berlin, 1985) p. 805. 
[6] M.F. Shlesinger, G.M. Zaslavsky, and J. Klafter, Nature (London) 363, 31 (1993). 
[7] C.V. Nagaev, Theor. Probab. & Appl., 2, 389 (1957) (In Russian). 
[8] I. Kanter and D.F. Kessler, Phys. Rev. Lett. 74, 4559 (1995). 
[9] M.I. Tribelsky, Phys. Rev. Lett. 87, 070201 (2002). 
[10] A. Schenkel, J. Zhang, and Y.C. Zhang, Fractals 1, 47 (1993). 



